Document Clustering with Dual Supervision

نویسنده

  • Yeming Hu
چکیده

Nowadays, academic researchers maintain a personal library of papers, which they would like to organize based on their needs, e.g., research, projects, or courseware. Clustering techniques are often employed to achieve this goal by grouping the document collection into different topics. Unsupervised clustering does not require any user effort but only produces one universal output with which users may not be satisfied. Therefore, document clustering needs user input for guidance to generate personalized clusters for different users. Semi-supervised clustering incorporates prior information and has the potential to produce customized clusters. Traditional semisupervised clustering is based on user supervision in the form of labeled instances or pairwise instance constraints. However, alternative forms of user supervision exist such as labeling features. For document clustering, document supervision involves labeling documents while feature supervision involves labeling features. Their joint use has been called dual supervision. In this thesis, we first explore and propose a framework to use feature supervision for interactive feature selection by indicating whether a feature is useful for clustering. Second, we enhance the semi-supervised clustering with feature supervision using feature reweighting. Third, we propose a unified framework to combine document supervision and feature supervision through seeding. The newly proposed algorithms are evaluated using oracles and demonstrated to be more helpful in producing better clusters matching a single user’s point of view than document clustering without any supervision or with only document supervision. Finally, we conduct a user study to confirm that different users have different understandings of the same document collection and prefer personalized clusters. At the same time, we demonstrate that document clustering with dual supervision is able to produce good personalized clusters even with noisy user input. Dual supervision is also demonstrated to be more effective in personalized clustering than no supervision or any single supervision. We also analyze users’ behaviors during the user study and present suggestions for the design of document management software.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Document with Active Learning using Wikipedia

Wikipedia has been applied as a background knowledge base to various text mining problems, including document categorization, topic indexing and information extraction. However, very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit Wikipedia and the semantic knowledge therein to facilitate clustering, enabling the automatic grouping of docum...

متن کامل

Regularized Co-Clustering with Dual Supervision

By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization ...

متن کامل

A Cluster-Level Semi-supervision Model for Interactive Clustering

Semi-supervised clustering models, that incorporate user provided constraints to yield meaningful clusters, have recently become a popular area of research. In this paper, we propose a cluster-level semi-supervision model for inter-active clustering. Prototype based clustering algorithms typically alternate between updating cluster descriptions and assignment of data items to clusters. In our m...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012